Explore and Summarize Data

Introduction:

In this project will use one of the exploratory data analysis techniques to explore the dataset of Red Wine Quality by using R. The Wine Quality description file which describes the variables and their meanings and how the data was collected.

First, Let’s take a look at the dataset

## Observations: 1,599
## Variables: 13
## $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Observations: As we can see, the dataset consists of 12 variables and 1599 observations. Eleven of the variables are numerical except for the quality. The quality variable represents as an ordered factor. Its range from 3 to 8 with 6 being the median.

Univariate Plots Section

In this part, will show all variables with univariate analysis with plots.

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations. All observation are basised on 11 features which are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH,sulphates, and alcohol.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the quality.

What other features in the dataset do you think will help support your

I think alcohol content, pH and total acidity level (volatile.acidity, fixed.acidity, citric.acid) will determine quality.

Did you create any new variables from existing variables in the dataset?

No, I didn’t create a new variable in the data set.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There are no unusual distributions, no missing attribute values. This dataset is totally tidy and no need to change the form of the data.

Bivariate Plots Section

We can see that a higher citric acid will higher the quality.

It looks the residual sugar has a low impact on the quality of red wines.

It looks the pH has a low impact on the quality of red wines.

It looks the alcohol content has a high impact on the quality of red wines.

It looks the sulphates have a high impact on the quality of red wines.

We can see that quality increased when fixed acidity has been increased.

We can see that residual sugar increased when density has been increased.

The density decrease with increase in the Alcohol content.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I have found that the correlation coefficient is positive for quality with citric. acid, fixed. acidity, sulphates, alcohol content. Also, it is positive between residual sugar and density.

Did you observe any interesting relationships between the other features
I have found that pH and residual sugar have a low impact on the quality of red wines.

It is interesting to see the relation between the density and the alcohol and sugar content. ### What was the strongest relationship you found?

The strongest relationship that I found in this dataset was between the quality and the alcohol content. # Multivariate Plots Section

Correlation Matrix

## corrplot 0.84 loaded

We can observe the quality of wines is impacted by pH. For the range of pH between 3.2 and 3.6 the quality was better.

The high-quality wine contains a high quantity of sulphates and alcohol.

The high-quality wine contains a high quantity of citric acid and alcohol.

The high-quality wines contain high alcohol with low volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

-We have seen how alcohol and volatile acidity relates to quality.

-The high-quality wines contain high alcohol with low volatile acidity.

-The high-quality wine contains a high quantity of alcohol and citric acid.

-The higher amounts of alcohol with low volatile acidity content yield the best quality of wines.

-Also, the high quantity of sulphates and alcohol make the best wines.

Final Plots and Summary

Plot One

Description One

From the scatterplot, we can see that the quality is increasing with the increase in alcohol content. They have a positive and strong relati

Plot Two

Description Two

As we can see from the histogram the quality of wines increased when the sulphates increased.

Plot Three

Description Three

When the amount of fixed acidity decrease the quantity of pH increased. The relationship between them is negative.

Reflection

The red wine dataset contains 1599 information red wines. By using R, I had tried to get a sense of what factors might affect the quality of the wine to make it best. As we have found to the high-quality wines contain high alcohol with low volatile acidity. The high alcohol with low volatile acidity makes best wines. Also, the higher amounts of alcohol with low volatile acidity content yield the best quality of wines. The relationship between alcohol and citric acid with quality was positive and strong. The interesting to observed in this dataset was the relationship between alcohol and sugar with the density of wines.